Heart Disease is the second leading cause of death globally, responsible for approx 11% of total deaths. And these days you see a spike in heart attacks/ heart related diseases especially in younger generation those who are in between 18 to late 30’s (these may be due to lot of factors, lifestyle patterns, dietary habits etc ) and they are equally prone to these diseases as an old age person, who’s above 70 or 80. So it’s very important to for us to analyze the lifestyle of a person, how we can use Exploratory Data Analysis to analyze the key factors that significantly impact the likelihood of developing heart related issues like heart disease.
SMART QUESTION: Given a dataset containing information about lifestyle of a person, Can we use exploratory data analysis to identify key factors like age, height, weight, smoking_history etc that significantly impact the likelihood of developing Health related issues like Heart Disease.
And our main end goal here is to build a predictive model with the help of which we’ll be able to accurately predict the likelihood of developing heart disease based on person’s lifestyle.
We took the dataset from kaggel https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset/data which contains approximately 300,000 samples with 19 features, out of them 12 categorical and 7 numerical, where our target variable is going to be Heart Disease.
Variable Introduction
Age: This is the age of the patient. Age is a crucial factor in disease prognosis as the risk of chronic conditions such as heart disease, cancer, diabetes, and arthritis increases with age. This is due to various factors including the cumulative effect of exposure to risk factors, increased wear and tear on the body, and changes in the body’s physiological functions.
Sex: This feature represents the gender of the patient. Gender can influence disease prognosis due to biological differences and gender-specific lifestyle patterns. For instance, heart disease is more common in males, while skin cancer is more common in females. This could be due to factors like longer life expectancy or different exposure to risk factors in each gender.
General_Health: This is a self-rated health status of the patient. Patients who perceive their health as “Poor” or “Fair” are more likely to have chronic conditions. This could be because the symptoms or management of these conditions impact their perceived health status.
Checkup: This feature represents the frequency of health checkups. Regular health checkups can help in early detection and management of diseases, thereby improving the prognosis.
Exercise: This feature indicates whether the patient exercises regularly or not. Regular exercise can help control weight, reduce risk of heart diseases, and manage blood sugar and insulin levels, among other benefits. This aligns with the negative correlation observed between exercise and diseases such as heart disease, diabetes, and arthritis.
Smoking_History: This feature indicates whether the patient has a history of smoking. Smoking can increase disease risk as it can damage blood vessels, increase blood pressure, and reduce the amount of oxygen reaching the organs.
These features collectively provide a comprehensive profile of the patient, incorporating demographic factors, health conditions, and lifestyle habits that are all known to influence disease prognosis. The model trained on these features thus has the potential to provide accurate disease predictions based on a wide range of factors.
df = read.csv('CVD.csv')
Below is the summary of the dataframe using skim function of skimr library. And we can see that there are 0 missing values in the dataset, couple of unique values throughout all the features. Mean, Median, SD, 0th, 25th,50th,75th,100th percentile is given for all the 7 numerical variable along with a small display of histogram.
skim(df)
| Name | df |
| Number of rows | 308854 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 12 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| General_Health | 0 | 1 | 4 | 9 | 0 | 5 | 0 |
| Checkup | 0 | 1 | 5 | 23 | 0 | 5 | 0 |
| Exercise | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Heart_Disease | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Skin_Cancer | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Other_Cancer | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Depression | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Diabetes | 0 | 1 | 2 | 42 | 0 | 4 | 0 |
| Arthritis | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| Sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| Age_Category | 0 | 1 | 3 | 5 | 0 | 13 | 0 |
| Smoking_History | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Height_.cm. | 0 | 1 | 170.62 | 10.66 | 91.00 | 163.00 | 170.00 | 178.00 | 241.00 | ▁▁▇▂▁ |
| Weight_.kg. | 0 | 1 | 83.59 | 21.34 | 24.95 | 68.04 | 81.65 | 95.25 | 293.02 | ▇▇▁▁▁ |
| BMI | 0 | 1 | 28.63 | 6.52 | 12.02 | 24.21 | 27.44 | 31.85 | 99.33 | ▇▅▁▁▁ |
| Alcohol_Consumption | 0 | 1 | 5.10 | 8.20 | 0.00 | 0.00 | 1.00 | 6.00 | 30.00 | ▇▁▁▁▁ |
| Fruit_Consumption | 0 | 1 | 29.84 | 24.88 | 0.00 | 12.00 | 30.00 | 30.00 | 120.00 | ▇▆▃▁▁ |
| Green_Vegetables_Consumption | 0 | 1 | 15.11 | 14.93 | 0.00 | 4.00 | 12.00 | 20.00 | 128.00 | ▇▂▁▁▁ |
| FriedPotato_Consumption | 0 | 1 | 6.30 | 8.58 | 0.00 | 2.00 | 4.00 | 8.00 | 128.00 | ▇▁▁▁▁ |
We can see the structure of the dataset after converting some of the categorical variables to factors. As we can see from the structure below General checkup, Sex, Age Category, Smoking History, Heart Disease, Skin Cancer, Other Cancer, Arthritis, Depression, Diabetes, Exercises are all converted to factors with levels.
# 'Sex' to a factor
df$Sex <- factor(df$Sex)
df$General_Health <- factor(df$General_Health)
df$Checkup <- factor(df$Checkup)
# 'Age_Category' to a factor
df$Age_Category <- factor(df$Age_Category)
# 'Smoking_History' to a factor
df$Smoking_History <- factor(df$Smoking_History)
# 'Heart_Disease' to a factor
df$Heart_Disease <- factor(df$Heart_Disease)
# 'Skin_Cancer' to a factor
df$Skin_Cancer <- factor(df$Skin_Cancer)
# 'Other_Cancer' to a factor
df$Other_Cancer <- factor(df$Other_Cancer)
# 'Arthritis' to a factor
df$Arthritis <- factor(df$Arthritis)
# 'Depression' to a factor
df$Depression <- factor(df$Depression)
# 'Diabetes' to a factor
df$Diabetes <- factor(df$Diabetes)
# 'Diabetes' to a factor
df$Exercise <- factor(df$Exercise)
str(df)
## 'data.frame': 308854 obs. of 19 variables:
## $ General_Health : Factor w/ 5 levels "Excellent","Fair",..: 4 5 5 4 3 3 2 3 2 2 ...
## $ Checkup : Factor w/ 5 levels "5 or more years ago",..: 3 5 5 5 5 5 5 5 5 5 ...
## $ Exercise : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 2 2 1 1 ...
## $ Heart_Disease : Factor w/ 2 levels "No","Yes": 1 2 1 2 1 1 2 1 1 1 ...
## $ Skin_Cancer : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Other_Cancer : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Depression : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
## $ Diabetes : Factor w/ 4 levels "No","No, pre-diabetes or borderline diabetes",..: 1 3 3 3 1 1 1 1 1 3 ...
## $ Arthritis : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 2 2 2 1 2 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 2 2 1 1 1 ...
## $ Age_Category : Factor w/ 13 levels "18-24","25-29",..: 11 11 9 12 13 9 9 10 10 11 ...
## $ Height_.cm. : num 150 165 163 180 191 183 175 165 163 163 ...
## $ Weight_.kg. : num 32.7 77.1 88.5 93.4 88.5 ...
## $ BMI : num 14.5 28.3 33.5 28.7 24.4 ...
## $ Smoking_History : Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 2 2 2 1 ...
## $ Alcohol_Consumption : num 0 0 4 0 0 0 0 3 0 0 ...
## $ Fruit_Consumption : num 30 30 12 30 8 12 16 30 12 12 ...
## $ Green_Vegetables_Consumption: num 16 0 3 30 4 12 8 8 12 12 ...
## $ FriedPotato_Consumption : num 12 4 16 8 0 12 0 8 4 1 ...
After careful observation, we found 80 duplicate values in the dataset, which have been removed using unique function in R.
dfDuplicate <- sum(duplicated(df))
cat("There were ", dfDuplicate," Duplicate Values", "\n")
## There were 80 Duplicate Values
df <- unique(df)
#Univariate Analysis ## Distribution of numeric variables
# Visualization of data distribution
numerical_features <- c('Height_.cm.', 'Weight_.kg.', 'BMI',
'Alcohol_Consumption', 'Fruit_Consumption',
'Green_Vegetables_Consumption', 'FriedPotato_Consumption')
# Loop through the numerical features and create histograms with density plots
for (feature in numerical_features) {
# Create a ggplot object for each feature
plot <- ggplot(df, aes_string(x = feature)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") + # Create a histogram
geom_density(aes(y = after_stat(count))) +
ggtitle(paste("Distribution of", feature)) +
xlab(feature) + ylab("Count") +
theme_minimal() # Minimal theme
# Display the plot
print(plot)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpretation Of Results: Height_.cm.: The height of the patients seems to be following normal distribution with majority of patients having heights around 160 to 180cm.
Weight_.kg.: The weight of the patients also appears to be following normal distribution, with patients having weights approximately between 60kg to 100kg.
BMI : The distribution of Body Mass Index is somewhat right-skewed. A large number of patients have a BMI between 20 and 30, which falls within the normal to overweight range. However, there are also a significant number of patients with a BMI in the obese range (>30).
Alcohol_Consumption : This feature is heavily right skewed. Most patients have very low alcohol consumption, but there are few patients with high alcohol consumption as well.
Fruit_Consumption : This feature also seems to be right skewed. A lot of patients consume fruits regularly, but a significant number consume them less regularly.
Green_Vegetables_Consumption : This feature appears to be following normal distribution with most patients consuming green vegetables regularly.
FriedPotato_Consumption : This feature seems to be right skewed. Many patients consume fried potatos less frequently, while few consume them more often.
# Define the list of categorical features
categorical_features <- c('General_Health', 'Checkup', 'Exercise', 'Heart_Disease',
'Skin_Cancer', 'Other_Cancer', 'Depression', 'Diabetes',
'Arthritis', 'Sex', 'Age_Category', 'Smoking_History')
# Loop through the categorical features and create count plots
for (feature in categorical_features) {
# Create a ggplot object for each feature
plot <- ggplot(df, aes_string(x = feature)) +
geom_bar() + # Create a bar plot
ggtitle(paste("Count of", feature)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + # Rotate x-axis labels
geom_text(stat='count', aes(label=scales::percent(..count../sum(..count..))),
vjust=1.6, color="black", size=3)
# Display the plot
print(plot)
}
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
General_Health : Most patients describe their general health as “Very Good”, with “Good” being the second most response. Few patients rate their health as “Fair” and “Poor”
Checkup : Majority of the patients did checkup with in the past year. Fewer patients had their last checkup 2 years ago or more than 5 years ago.
Exercise : 78% of the population from our data set do exercise regulartly
Heart_Disease : 92% of the population in our data set do not have Heart Disease
Skin_Cancer : Vast majority of patients do not have skin cancer that’s 90% of our population.
Other_Cancer : Similar to skin cancer, vast majority of patients do not have any kind of cancer.
Depression : Most patients do not have any depression, however small proportion of patients do report having depression thats around 20% of our data set population.
Diabetes: Most patients do not have diabetes, which is similar to the disease-related features mentioned above. However, diabetes affects a small percentage of the population.
Arthritis: While the majority of patients do not have arthritis that’s 67% of our data set, a significant number do.
Sex: We have good data set with almost equal distribution in Sex with Males of 48.1% and Females of 51.9%.
Age_Category: Patients from a wide range of ages are included in the data set. The age group 65 to 69 has the most patients, followed by the 60 to 64.
Smoking_History: 41% of the people in our data set had history of smoking history and 59% of the peopple did not have an history of smoking. (NOTE: this is with people with smoking history and not current smokers, so we dont know if they were chain smokers or just casual smokers)
# Viz Heart Disease and Smoking History
ggplot(df, aes(x = Smoking_History, fill = Heart_Disease)) +
geom_bar(position = "fill") +
labs(title = "Smoking history vs. Heart_Disease")
The graph above shows a clear correlation between smoking history and
heart disease as we can observe that compared to people who have not
smoked with people with smoking history have heart disease also not a
major percentage of the people with smoking history have heart disease.
People who have a history of smoking are significantly more likely to
have heart disease than those who do not have a history of smoking. This
is because smoking damages the heart and blood vessels, making them more
susceptible to disease.
ggplot(df, aes(x = Age_Category, fill = Heart_Disease)) +
geom_bar(position = "fill") +
labs(title = "Age_Category vs. Heart_Disease")
The graph shows the percentage of people who have heart disease by age
category. The percentage of people who have heart disease is higher in
the older age categories than in the younger age categories. This is
likely due to a number of other health related factors.
ggplot(df, aes(x = Other_Cancer, fill = Heart_Disease)) +
geom_bar(position = "fill") +
labs(title = "Other_Cancer vs. Heart_Disease")
The graph shows that the percentage of people who have had other cancers have slight more perctatage of people with heart disease compared to people with people who did not have other cancers, we can not be sure that there is any relation between them but at the same time we can not tell that they do have.
ggplot(df, aes(x = Sex, fill = Heart_Disease)) +
geom_bar(position = "fill") +
labs(title = "Sex vs. Heart_Disease")
From the graph we can observe that Males have a slightly higher
percentage of people with heart disease compared to Females even here we
can not conclude that there is a relationship between sex and heart
disease, so we should check with others methods for corelation between
them.
ggplot(df, aes(x = Skin_Cancer, fill = Heart_Disease)) +
geom_bar(position = "fill") +
labs(title = "Skin_Cancer vs. Heart_Disease")
Same like other cancers we can see that there similar distribution even
with Skin cancer we can see that percentage of people with heart disease
and
ggplot(df, aes(x = (Heart_Disease), y = Alcohol_Consumption, fill = Heart_Disease)) +
geom_boxplot() +
labs(title = "Alcohol Consumption vs. Heart Disease") +
ylab("Alcohol Consumption")
ggplot(df, aes(x = Heart_Disease, y = BMI, fill = Heart_Disease)) +
geom_boxplot() +
labs(title = "BMI vs. Heart Disease") +
ylab("BMI")
ggplot(df, aes(x = Heart_Disease, y = FriedPotato_Consumption, fill = Heart_Disease)) +
geom_boxplot() +
labs(title = "Fried Potato Consumption vs. Heart Disease") +
ylab("Fried Potato Consumption")
We can see almost same percenatge of people who consume fried potatos
who have heart disease and who do not hace, but this is surpsrising
beacsue fried potato is one the main causes for heart diseases lets
observe this relationship later in our analysis and also we can see a
lot of out liers but we are not removing these as we are not sure if
they are actual outliers.
ggplot(df, aes(x = factor(Heart_Disease), y = Fruit_Consumption, fill = Heart_Disease)) +
geom_boxplot() +
labs(title = "Fruit Consumption vs. Heart Disease") +
ylab("Fruit Consumption")
We can observe in the above graph again a similar trend like Fried
Potato Consumption vs. Heart Disease, almost equal distribution with
people who have heart diseases and who do not have any heart diseases
with resoect to fruit consumption.
ggplot(df, aes(x = BMI, fill = Heart_Disease)) +
geom_density(alpha = 0.5) +
labs(title = "BMI Density by Heart_Disease", x = "BMI") +
scale_fill_manual(values = c("No" = "blue", "Yes" = "red"))
ggplot(df, aes(x = Heart_Disease, y = BMI, fill = Heart_Disease)) +
geom_violin() +
labs(title = "BMI vs. Heart_Disease", x = "Heart_Disease", y = "BMI")
# Histograms of BMI by Heart_Disease
ggplot(df, aes(x = BMI, fill = Heart_Disease)) +
geom_histogram() +
labs(title = "Histograms of BMI by Heart_Disease", x = "BMI", y = "Frequency") +
scale_fill_manual(values = c("No" = "blue", "Yes" = "red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Box Plots for Height and Weight vs. Heart_Disease
boxplot_data <- df %>%
select(Heart_Disease, Height_.cm., Weight_.kg.)
boxplot_data_long <- boxplot_data %>%
pivot_longer(cols = -Heart_Disease, names_to = "Variable", values_to = "Value")
p <- ggplot(boxplot_data_long, aes(x = Heart_Disease, y = Value, fill = Heart_Disease)) +
geom_boxplot() +
facet_wrap(~Variable, scales = "free_y") +
labs(title = "Height and Weight vs. Heart_Disease")
print(p)
In the above box plots we could not observe any major relationship
between heart disease with height or weight # Hypothesis Testing ##
chi-squared test
# List of categorical variables to be check for relation with Heart_Disease
categorical_variables <- c("Skin_Cancer", "Other_Cancer", "Depression", "Diabetes", "Arthritis", "Sex", "Age_Category", "Exercise", "Smoking_History","Smoking_History")
# chi-squared test for each categorical variable
for (var in categorical_variables) {
chi_square_result <- chisq.test(df$Heart_Disease, df[[var]])
cat("\nChi-squared Test: ", var, "vs. Heart_Disease:\n")
print(chi_square_result)
}
##
## Chi-squared Test: Skin_Cancer vs. Heart_Disease:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$Heart_Disease and df[[var]]
## X-squared = 2546.6, df = 1, p-value < 2.2e-16
##
##
## Chi-squared Test: Other_Cancer vs. Heart_Disease:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$Heart_Disease and df[[var]]
## X-squared = 2633.3, df = 1, p-value < 2.2e-16
##
##
## Chi-squared Test: Depression vs. Heart_Disease:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$Heart_Disease and df[[var]]
## X-squared = 325.72, df = 1, p-value < 2.2e-16
##
##
## Chi-squared Test: Diabetes vs. Heart_Disease:
##
## Pearson's Chi-squared test
##
## data: df$Heart_Disease and df[[var]]
## X-squared = 10414, df = 3, p-value < 2.2e-16
##
##
## Chi-squared Test: Arthritis vs. Heart_Disease:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$Heart_Disease and df[[var]]
## X-squared = 7311.3, df = 1, p-value < 2.2e-16
##
##
## Chi-squared Test: Sex vs. Heart_Disease:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$Heart_Disease and df[[var]]
## X-squared = 1627.2, df = 1, p-value < 2.2e-16
##
##
## Chi-squared Test: Age_Category vs. Heart_Disease:
##
## Pearson's Chi-squared test
##
## data: df$Heart_Disease and df[[var]]
## X-squared = 18033, df = 12, p-value < 2.2e-16
##
##
## Chi-squared Test: Exercise vs. Heart_Disease:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$Heart_Disease and df[[var]]
## X-squared = 2863.9, df = 1, p-value < 2.2e-16
##
##
## Chi-squared Test: Smoking_History vs. Heart_Disease:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$Heart_Disease and df[[var]]
## X-squared = 3584.5, df = 1, p-value < 2.2e-16
##
##
## Chi-squared Test: Smoking_History vs. Heart_Disease:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: df$Heart_Disease and df[[var]]
## X-squared = 3584.5, df = 1, p-value < 2.2e-16
### My Notes: If p value < 0.5 it indicates that both variables are significantly related
### All values are less than 0 and shows that all are significantly related (Should check the code r is it completly fine)
1.Objective: To test whether there is a significant difference in the mean “Fruit_Consumption” between individuals with and without heart disease.
Hypothesis: Null Hypothesis (H0): There is no significant difference in “Fruit_Consumption” between individuals with and without heart disease.
Alternative Hypothesis (H1): There is a significant difference in “Fruit_Consumption” between individuals with and without heart disease.
t_test_result <- t.test(Fruit_Consumption ~ Heart_Disease, data = df)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: Fruit_Consumption by Heart_Disease
## t = 11.32, df = 29722, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## 1.512380 2.145763
## sample estimates:
## mean in group No mean in group Yes
## 29.98221 28.15314
#Observations * The t-test results show that there is a significant difference in “Fruit_Consumption” between people with and without heart disease. * The extremely low p-value indicates that there is strong evidence to reject the null hypothesis (H0). * The 95% confidence interval, which excludes zero, adds to the rejection of the null hypothesis. Individuals without heart disease consume more fruit than those with heart disease.
2.Objective: To test whether there is a significant difference in the mean “Alcohol_Consumption” between individuals with and without heart disease.
Null Hypothesis (H0): There is no significant difference in “Alcohol_Consumption” between individuals with and without heart disease.
Alternative Hypothesis (Ha): There is a significant difference in “Alcohol_Consumption” between individuals with and without heart disease.
t_test_result2 <- t.test(Alcohol_Consumption ~ Heart_Disease, data = df)
print(t_test_result2)
##
## Welch Two Sample t-test
##
## data: Alcohol_Consumption by Heart_Disease
## t = 20.475, df = 29602, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## 0.9958565 1.2067033
## sample estimates:
## mean in group No mean in group Yes
## 5.186619 4.085339
Conclusion: * The t-Test results show a significant difference in “Alcohol_Consumption” between people with and without heart disease. * The extremely low p-value indicates that there is strong evidence to reject the null hypothesis (H0). * The 95% confidence interval, which excludes zero, adds to the rejection of the null hypothesis. Individuals without heart disease consume more alcohol on average than those with heart disease.
3.Objective: To test whether there is a significant difference in the mean “Green_Vegetables_Consumption” between individuals with and without heart disease.
Null Hypothesis (H0): There is no significant difference in “Green_Vegetables_Consumption” between individuals with and without heart disease.
Alternative Hypothesis (Ha): There is a significant difference in “Green_Vegetables_Consumption” between individuals with and without heart disease.
t_test_result3 <- t.test(Green_Vegetables_Consumption ~ Heart_Disease, data = df)
print(t_test_result3)
##
## Welch Two Sample t-test
##
## data: Green_Vegetables_Consumption by Heart_Disease
## t = 14.186, df = 30275, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## 1.133736 1.497262
## sample estimates:
## mean in group No mean in group Yes
## 15.2159 13.9004
Conclusion:
4.Objective: To test whether there is a significant difference in the mean “FriedPotato_Consumption” between individuals with and without heart disease.
Null Hypothesis (H0): There is no significant difference in “FriedPotato_Consumption” between individuals with and without heart disease.
Alternative Hypothesis (Ha): There is a significant difference in “FriedPotato_Consumption” between individuals with and without heart disease.
t_test_result5 <- t.test(FriedPotato_Consumption ~ Heart_Disease, data = df)
print(t_test_result5)
##
## Welch Two Sample t-test
##
## data: FriedPotato_Consumption by Heart_Disease
## t = 5.1817, df = 29631, p-value = 2.213e-07
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## 0.1810443 0.4013388
## sample estimates:
## mean in group No mean in group Yes
## 6.320786 6.029594
Conclusion: * The results t-Test indicate a significant difference in “FriedPotato_Consumption” between individuals with and without heart disease. * The small p-value (2e-07) suggests strong evidence to reject the null hypothesis (H0). * The 95% confidence interval, which does not include 0, further supports the rejection of the null hypothesis. * On average, individuals without heart disease consume slightly more fried potatoes compared to those with heart disease.
df_yes <- subset(df, Heart_Disease == "Yes")
df_no <- subset(df, Heart_Disease == "No")
top_1000_df_yes <- head(df_yes, 1000)
top_1000_df_no <- head(df_no, 1000)
merged_df <- rbind(top_1000_df_yes, top_1000_df_no)
ggplot(merged_df, aes(x = Weight_.kg., y = BMI, color = Heart_Disease)) +
geom_point() +
labs(title = "scatterplot of weight vs bmi colored by heart disease",
x = "weight",
y = "bmi") +
theme_minimal()
1. Most of the people without heart disease are lying towards the lower end of the plot, which also states that people with heavy weights and heavy body mass index seems to get affected more by heart disease.
## Observations
1. Can’t find significant patterns but People with alocohol consumption 0 shows no signs of heart disease
2. Majority of people suffering with heart disease falls under alcohol consumption level of 15-30 and also people with heavy weights.
1. Person with No Smoking history and No Depression shows no signs of Heart Disease
2. Among Smoking History and Depression, Smoking History seems to have more impact on a Person’s heart, because as seen in the plot, People with Smoking history and without depression are having heart diseases more than people without smoking history and with depression
1. Most of the people with heart disease are falling under person identified as Male with Smoking History
2. Hardly fewer or no observations for person identified as Female without Smoking History, but there are a few data points falling under a Male without smoking history.
# 3D Plot
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(corrplot)
## corrplot 0.92 loaded
ds <- read.csv("CVD.csv")
convert_age_category <- function(category) {
if (grepl("-", category)) {
range <- as.numeric(strsplit(category, "-")[[1]])
avg <- mean(range)
} else {
avg <- as.numeric(gsub("[^0-9]", "", category))
}
return(avg)
}
ds$Age_Category <- sapply(ds$Age_Category, convert_age_category)
plot_3d <- plot_ly(ds, x = ~BMI, y = ~Age_Category, z = ~Heart_Disease,
type = "scatter3d", mode = "markers",
marker = list(size = 5, color = 'green', opacity = 0.7))
plot_3d <- plot_3d %>% layout(scene = list(
xaxis = list(title = 'BMI', backgroundcolor = "rgb(200, 200, 230)"),
yaxis = list(title = 'Age_Category', backgroundcolor = "rgb(230, 200, 230)"),
zaxis = list(title = 'Heart_Disease', backgroundcolor = "rgb(200, 230, 200)")
),
margin = list(l = 0, r = 0, b = 0, t = 0))
plot_3d
variables_to_analyze <- c("Height_.cm.", "Weight_.kg.", "BMI", "Fruit_Consumption", "Green_Vegetables_Consumption", "FriedPotato_Consumption", "Alcohol_Consumption")
anova_results <- list()
for (var in variables_to_analyze) {
anova_result <- aov(reformulate("Heart_Disease", response = var), data = df)
anova_results[[var]] <- anova_result
}
# Print
for (var in variables_to_analyze) {
cat("ANOVA Test for", var, "\n")
print(summary(anova_results[[var]]))
}
## ANOVA Test for Height_.cm.
## Df Sum Sq Mean Sq F value Pr(>F)
## Heart_Disease 1 8738 8738 76.93 <2e-16 ***
## Residuals 308772 35068677 114
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ANOVA Test for Weight_.kg.
## Df Sum Sq Mean Sq F value Pr(>F)
## Heart_Disease 1 295787 295787 650.6 <2e-16 ***
## Residuals 308772 140379554 455
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ANOVA Test for BMI
## Df Sum Sq Mean Sq F value Pr(>F)
## Heart_Disease 1 23888 23888 562.5 <2e-16 ***
## Residuals 308772 13113494 42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ANOVA Test for Fruit_Consumption
## Df Sum Sq Mean Sq F value Pr(>F)
## Heart_Disease 1 76785 76785 124.1 <2e-16 ***
## Residuals 308772 191024540 619
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ANOVA Test for Green_Vegetables_Consumption
## Df Sum Sq Mean Sq F value Pr(>F)
## Heart_Disease 1 39719 39719 178.4 <2e-16 ***
## Residuals 308772 68758826 223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ANOVA Test for FriedPotato_Consumption
## Df Sum Sq Mean Sq F value Pr(>F)
## Heart_Disease 1 1946 1946.1 26.41 2.76e-07 ***
## Residuals 308772 22749145 73.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ANOVA Test for Alcohol_Consumption
## Df Sum Sq Mean Sq F value Pr(>F)
## Heart_Disease 1 27836 27836 414.5 <2e-16 ***
## Residuals 308772 20736260 67
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1